AITopics

Country:

Asia > China > Shanghai > Shanghai (0.05)
Asia > China > Beijing > Beijing (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)

Neural Information Processing SystemsFeb-9-2026, 15:35:56 GMT

MultimodalandMultilingualEmbeddings forLarge-ScaleSpeechMining

Using a similarity metric in that multimodal embedding space, we perform mining of audio in German, French, Spanish and English from Librivox against billions of sentences from CommonCrawl.

artificial intelligence, machine learning, natural language, (19 more...)

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Spain (0.04)
Europe > Austria > Styria > Graz (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.49)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.48)

Neural Information Processing SystemsDec-23-2025, 19:27:47 GMT

Pay Better Attention to Attention: Head Selection in Multilingual and Multi-Domain Sequence Modeling

Multi-head attention has each of the attention heads collect salient information from different parts of an input sequence, making it a powerful mechanism for sequence modeling. Multilingual and multi-domain learning are common scenarios for sequence modeling, where the key challenge is to maximize positive transfer and mitigate negative interference across languages and domains. In this paper, we find that non-selective attention sharing is sub-optimal for achieving good generalization across all languages and domains. We further propose attention sharing strategies to facilitate parameter sharing and specialization in multilingual and multi-domain sequence modeling. Our approach automatically learns shared and specialized attention heads for different languages and domains. Evaluated in various tasks including speech recognition, text-to-text and speech-to-text translation, the proposed attention sharing strategies consistently bring gains to sequence models built upon multi-head attention.

attention, multilingual and multi-domain sequence modeling, pay better attention, (7 more...)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.64)

arXiv.org Artificial IntelligenceNov-13-2025

POTSA: A Cross-Lingual Speech Alignment Framework for Low Resource Speech-to-Text Translation

Li, Xuanchen, Cui, Chenrui, Wang, Tianrui, Ge, Meng, Huang, Zikang, Li, Jin, Peng, Yizhou, Wang, Longbiao, Dang, Jianwu, Tashi, Nyima

Speech Large Language Models (SpeechLLMs) have achieved breakthroughs in multilingual speech-to-text translation (S2TT). However, existing approaches often overlook semantic commonalities across source languages, leading to biased translation performance. In this work, we propose \textbf{POTSA} (Parallel Optimal Transport for Speech Alignment), a new framework based on cross-lingual parallel speech pairs and Optimal Transport (OT), designed to bridge high- and low-resource translation gaps. First, we introduce a Bias Compensation module to coarsely align initial speech representations across languages. Second, we impose token-level OT constraints on a Q-Former using parallel speech pairs to establish fine-grained consistency of representations. Then, we apply a layer scheduling strategy to focus OT constraints on the most semantically beneficial layers. Experiments on the FLEURS dataset show that our method achieves SOTA performance, with +0.93 BLEU on average over five common languages and +5.05 BLEU on zero-shot languages, using only 10 hours of parallel speech per source language.

artificial intelligence, large language model, natural language, (18 more...)

2511.09232

Country: Asia > China > Tibet Autonomous Region (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.95)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.92)

Neural Information Processing SystemsOct-9-2025, 05:32:30 GMT

ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation

We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models and optimized data-efficiently for spoken language tasks.

artificial intelligence, arxiv preprint arxiv, natural language, (17 more...)

Country:

Asia > China > Shanghai > Shanghai (0.05)
Asia > China > Beijing > Beijing (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)

Romero-Díaz, Jacobo, Gállego, Gerard I., Pareras, Oriol, Costa, Federico, Hernando, Javier, España-Bonet, Cristina

Listening or Reading? Evaluating Speech Awareness in Chain-of-Thought Speech-to-Text Translation

arXiv.org Artificial IntelligenceOct-6-2025

Speech-to-Text Translation (S2TT) systems built from Automatic Speech Recognition (ASR) and Text-to-Text Translation (T2TT) modules face two major limitations: error propagation and the inability to exploit prosodic or other acoustic cues. Chain-of-Thought (CoT) prompting has recently been introduced, with the expectation that jointly accessing speech and transcription will overcome these issues. Analyzing CoT through attribution methods, robustness evaluations with corrupted transcripts, and prosody-awareness, we find that it largely mirrors cascaded behavior, relying mainly on transcripts while barely leveraging speech. Simple training interventions, such as adding Direct S2TT data or noisy transcript injection, enhance robustness and increase speech attribution. These findings challenge the assumed advantages of CoT and highlight the need for architectures that explicitly integrate acoustic information into translation.

artificial intelligence, machine learning, translation, (15 more...)

2510.03115

Country: Europe > Spain (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Gállego, Gerard I., Pareras, Oriol, Garcia, Martí Cortada, Takanori, Lucas, Hernando, Javier

Speech-to-Text Translation with Phoneme-Augmented CoT: Enhancing Cross-Lingual Transfer in Low-Resource Scenarios

arXiv.org Artificial IntelligenceSep-30-2025

We propose a Speech-to-Text Translation (S2TT) approach that integrates phoneme representations into a Chain-of-Thought (CoT) framework to improve translation in low-resource and zero-resource settings. By introducing phoneme recognition as an intermediate step, we enhance cross-lingual transfer, enabling translation even for languages with no labeled speech data. Our system builds on a multilingual LLM, which we extend to process speech and phonemes. Training follows a curriculum learning strategy that progressively introduces more complex tasks. Experiments on multilingual S2TT benchmarks show that phoneme-augmented CoT improves translation quality in low-resource conditions and enables zero-resource translation, while slightly impacting high-resource performance. Despite this trade-off, our findings demonstrate that phoneme-based CoT is a promising step toward making S2TT more accessible across diverse languages.

machine learning, natural language, translation, (19 more...)

doi: 10.21437/Interspeech.2025-1954

2505.24691

Country:

North America > United States (0.68)
Europe (0.68)
Asia > Middle East > UAE (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Artificial IntelligenceSep-22-2025

Direct Simultaneous Translation Activation for Large Audio-Language Models

Zhang, Pei, Wang, Yiming, Tang, Jialong, Yang, Baosong, Wang, Rui, Wong, Derek F., Huang, Fei

Simultaneous speech-to-text translation (Simul-S2TT) aims to translate speech into target text in real time, outputting translations while receiving source speech input, rather than waiting for the entire utterance to be spoken. Simul-S2TT research often modifies model architectures to implement read-write strategies. However, with the rise of large audio-language models (LALMs), a key challenge is how to directly activate Simul-S2TT capabilities in base models without additional architectural changes. In this paper, we introduce {\bf Simul}taneous {\bf S}elf-{\bf A}ugmentation ({\bf SimulSA}), a strategy that utilizes LALMs' inherent capabilities to obtain simultaneous data by randomly truncating speech and constructing partially aligned translation. By incorporating them into offline SFT data, SimulSA effectively bridges the distribution gap between offline translation during pretraining and simultaneous translation during inference. Experimental results demonstrate that augmenting only about {\bf 1\%} of the simultaneous data, compared to the full offline SFT data, can significantly activate LALMs' Simul-S2TT capabilities without modifications to model architecture or decoding strategy.

artificial intelligence, natural language, translation, (16 more...)

2509.15692

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

arXiv.org Artificial IntelligenceAug-18-2025

Novel Parasitic Dual-Scale Modeling for Efficient and Accurate Multilingual Speech Translation

Le, Chenyang, Xia, Yinfeng, Li, Huiyan, Wang, Manhong, Sun, Yutao, Ma, Xingyang, Qian, Yanmin

Recent advancements in speech-to-text translation have led to the development of multilingual models capable of handling multiple language pairs simultaneously. However, these unified models often suffer from large parameter sizes, making it challenging to balance inference efficiency and performance, particularly in local deployment scenarios. We propose an innovative Parasitic Dual-Scale Approach, which combines an enhanced speculative sampling method with model compression and knowledge distillation techniques. Building on the Whisper Medium model, we enhance it for multilingual speech translation into whisperM2M, and integrate our novel KVSPN module, achieving state-of-the-art (SOTA) performance across six popular languages with improved inference efficiency. KVSPN enables a 40\% speedup with no BLEU score degradation. Combined with distillation methods, it represents a 2.6$\times$ speedup over the original Whisper Medium with superior performance.

artificial intelligence, natural language, translation, (16 more...)